fix(datasets): Add shuffle option to IidPartitioner by WilliamLindskog · Pull Request #7385 · flwrlabs/flower

WilliamLindskog · 2026-06-15T20:41:03Z

What changed

Add optional shuffle and seed parameters to IidPartitioner
Preserve the existing contiguous-slice behavior by default (shuffle=False)
Cache the shuffled dataset per partitioner instance so repeated partition loads stay stable, including when seed=None
Document the sorted-local-dataset case where shuffling before IID partitioning avoids label-skewed partitions

Issue/PR mapping

Fixes [feature]: Support shuffling in IIDPartitioner or update documentation #7327
Supersedes the focused shuffle behavior in fix(datasets): Support shuffling in IidPartitioner #7329

Validation

pytest, ruff, mypy, and black --check on the touched IID partitioner files
git diff --check

Copilot

Pull request overview

This PR enhances flwr_datasets by (1) adding optional shuffling to IidPartitioner while preserving the historical contiguous-shard default behavior, and (2) introducing public partition skew distance metrics (Hellinger and Jensen–Shannon) to quantify how partition label/target distributions differ from the full dataset.

Changes:

Extend IidPartitioner with shuffle/seed and cache the shuffled dataset per instance for stable repeated loads.
Add compute_hellinger_distances and compute_jensen_shannon_distances (with optional binning for continuous targets) plus test coverage.
Update README and Sphinx docs to reflect the new IidPartitioner signature and new skew metrics.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
`datasets/README.md`	Documents `IidPartitioner(shuffle, seed)` usage and introduces partition skew metrics in the library overview/quickstart.
`datasets/flwr_datasets/partitioner/iid_partitioner.py`	Adds `shuffle`/`seed` parameters and per-instance caching of the shuffled dataset used for sharding.
`datasets/flwr_datasets/partitioner/iid_partitioner_test.py`	Adds regression and determinism tests for default contiguous behavior and shuffled sharding.
`datasets/flwr_datasets/metrics/utils.py`	Implements Hellinger and Jensen–Shannon distance utilities (including optional binning) and related helpers.
`datasets/flwr_datasets/metrics/utils_test.py`	Adds unit tests validating distance values, binning behavior, and input validation.
`datasets/flwr_datasets/metrics/__init__.py`	Exposes the new metric functions as part of the public `flwr_datasets.metrics` API.
`datasets/docs/source/index.rst`	Updates feature list and `IidPartitioner` signature in docs landing page.
`datasets/docs/source/how-to-use-with-local-data.rst`	Adds guidance for shuffling sorted local datasets via `IidPartitioner(shuffle=True, seed=...)`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

This reverts commit 08b8595.

WilliamLindskog · 2026-06-16T02:38:09Z

Update: I split this into the lower-friction review path.

fix(datasets): Add shuffle option to IidPartitioner #7385 is now shuffle-only: IidPartitioner(shuffle, seed) plus docs/tests for sorted local datasets.
The partition-skew metrics work moved to draft follow-up feat(datasets): Add partition skew distance metrics #7389, stacked on this branch so its diff is metrics-only.

Focused validation passed for the touched IID partitioner files: pytest, ruff, mypy, black --check, and git diff --check.

jafermarq · 2026-06-16T08:01:45Z

+        if not self._shuffle:
+            return self.dataset
+        if self._shuffled_dataset is None:
+            self._shuffled_dataset = self.dataset.shuffle(seed=self._seed)


does this mean we have two copies of the dataset?

Good question. Dataset.shuffle(...) returns another Hugging Face Dataset object with shuffled indices/cache metadata rather than eagerly duplicating all row data. So this keeps a second dataset object around, but it should not be a full in-memory copy of the underlying dataset. The cache here is intentional so repeated load_partition calls use the same shuffled order, especially when seed=None.

William Lindskog-Munzing added 2 commits June 15, 2026 16:38

fix(datasets): add shuffle option to IidPartitioner

6fb43ae

feat(datasets): add partition skew distance metrics

08b8595

Copilot AI review requested due to automatic review settings June 15, 2026 20:41

Copilot started reviewing on behalf of WilliamLindskog June 15, 2026 20:41 View session

Copilot AI reviewed Jun 15, 2026

View reviewed changes

WilliamLindskog changed the title ~~fix(datasets): add IID shuffle and partition skew metrics~~ fix(datasets): Add IID shuffle and partition skew metrics Jun 15, 2026

github-actions Bot added the Maintainer Used to determine what PRs (mainly) come from Flower maintainers. label Jun 15, 2026

WilliamLindskog and others added 2 commits June 15, 2026 22:08

Merge branch 'main' into fix/flwr-datasets-iid-metrics

5c6cf8a

Revert "feat(datasets): add partition skew distance metrics"

547e4de

This reverts commit 08b8595.

WilliamLindskog changed the title ~~fix(datasets): Add IID shuffle and partition skew metrics~~ fix(datasets): Add shuffle option to IidPartitioner Jun 16, 2026

WilliamLindskog mentioned this pull request Jun 16, 2026

feat(datasets): Add partition skew distance metrics #7389

Draft

WilliamLindskog marked this pull request as ready for review June 16, 2026 03:15

WilliamLindskog requested review from danieljanes, jafermarq and tanertopal as code owners June 16, 2026 03:15

jafermarq reviewed Jun 16, 2026

View reviewed changes

Comment thread datasets/docs/source/index.rst Outdated

jafermarq reviewed Jun 16, 2026

View reviewed changes

Comment thread datasets/flwr_datasets/partitioner/iid_partitioner.py Outdated

jafermarq reviewed Jun 16, 2026

View reviewed changes

Comment thread datasets/README.md Outdated

jafermarq reviewed Jun 16, 2026

View reviewed changes

Comment thread datasets/README.md Outdated

William Lindskog-Munzing and others added 3 commits June 16, 2026 08:07

docs(datasets): simplify IID partitioner guidance

77f73dc

Merge branch 'main' into fix/flwr-datasets-iid-metrics

2a61fb2

Merge branch 'main' into fix/flwr-datasets-iid-metrics

b4aaeb4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(datasets): Add shuffle option to IidPartitioner#7385

fix(datasets): Add shuffle option to IidPartitioner#7385
WilliamLindskog wants to merge 7 commits into
mainfrom
fix/flwr-datasets-iid-metrics

WilliamLindskog commented Jun 15, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

WilliamLindskog commented Jun 16, 2026

Uh oh!

Uh oh!

Uh oh!

jafermarq Jun 16, 2026

Uh oh!

WilliamLindskog Jun 16, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

WilliamLindskog commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changed

Issue/PR mapping

Validation

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

WilliamLindskog commented Jun 16, 2026

Uh oh!

Uh oh!

Uh oh!

jafermarq Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

WilliamLindskog Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

WilliamLindskog commented Jun 15, 2026 •

edited

Loading